Michael L. Davies
mld9s
What we can learn about conflict environments from text analysis of social media posts? I assume that social media post from within a conflict zone reflects the environment. Of course, sentiments will likely ebb and flow, but are likely more negative in nature. Additionally, I wonder if text can predict (or are associated with) particular events on the ground.
For this project, I leveraged data maintained by The Armed Conflict Location & Event Data Project (ACLED). I narrowed the scope to the Syrian conflict, which persisted for more than a decade. Since 2017, ACLED has collected more than 80,000 social media and open source posts. The posts are then tagged by a curator with the date, province, city, and various associated events such as which actor (the Syria Army or non-state armed group) gained territory.
Note, because I'm interest in data science (rather than Anthropolical) conventions, I chose to diverge from the project description in a few ways.
Interestingly, and possibly because we didn't leverage standard Python libraries for this class, I found R much cleaner in it's approach and pipelines. And I found much more interesting results with R.
Section One: Pre-processing and building dataframes
Section Two: Sentiment Analysis
Section Three: Topic Modelling
Section Four: Cluster Analysis
Section Five: Classification - conventional pipelines
Implementation in R included in separate files
Beyond the analysis of parts of speech and word frequencies…
Sentiments/polarity:
Polarity counts to vary across provinces. This is to be expected. The conflict has taken on different characteristics in different provinces.
In addition, polarity has varied widely over time, and differentially by province.
The dominant emotions have been “fear”, “anger”, “sadness”. These emotions ebbed and flowed over time, but always outpaced “trust” and “joy” – regardless of which province we look at. This is later supported by the Vader approach.
An important caveat with respect to sentiments – the word “opposition” appears to be coded in the lexicon as a negative emotion. However, it’s not always clear that the sentence itself is expressing negativity.
Topic Modeling
The text did not warrant a large number of topics, so I set the number of topics in Python to 10. (With R, I lowered the number of potential topics to 6)
From a heat map of the topic scores, we once again see that topics vary by province. Here we see variation in the topics by province.
Classification
My primary interest was – can we use text from social media to predict the outcome of a battle. As such, my response variable is binary where the social media text is associated with "Syrian regime regains territory" or "Non-state actor gains territory" - recoded as 1 and 0 respectively
I first clean the data for only those cases code according to the response variable. (Many posts are only associated with various types of clashes and battles with no turn over in terrain.) Fortunately, the data is balanced across labels, so no up/down sampling required and “accuracy” is sufficient for evaluating the models.
I then compete multiple classification algorithms.
I begin with the simplest model. sklearn's feature extraction package handles the text preprocessing before sending it to the logistic regression. Additionally, I import the module for the train/test split (set at 75/25)
In Python, I compete three classification approaches.
First, I leverage spaCy for preprocessing and added the TF-IDF module with the sklearn for classification. Second, I use a base line model, which is the simplest, form sklearn without TF-IDF. Last, I implemented a Keras neural network, which is a high-level language that sits on top of TensorFlow.
In the end, prediction was extremely robust, with all models achieve accuracy between 92 and 94 percent. Interestingly, Keras (TensorFlow) was not the most accurate.
Findings in R
Given that this is not required for this project, I won't elaborate on findings here. However, I was able to implement a much more sophisticated treatment of uni-grams and bigrams. As well, I conducted a network analysis of words associated with each label, (a cleaner) topic analysis, frequencies of co-occurances with words of interest. Last, I conducted a bootstrapped logistic regression to predict the labels. For this, I plotted a variables importance plot to show which words (unigrams or bigrams) were the most important in predicting the response.
# Importing modules
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import plotly_express as px
%matplotlib inline
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets') #part-of-speech tagging
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\micha\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\micha\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\micha\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package tagsets to [nltk_data] C:\Users\micha\AppData\Roaming\nltk_data... [nltk_data] Package tagsets is already up-to-date!
True
# read in data
df_raw = pd.read_csv('df_acled_syr_2017_2021.csv', encoding = 'utf-8')
print(df_raw.shape)
(85104, 25)
# select columns of interest
df = df_raw[['country', 'year', 'event_date', 'admin1', 'admin2',
'event_type', 'sub_event_type', 'fatalities',
'actor1', 'assoc_actor_1', 'actor2', 'assoc_actor_2',
'notes', 'longitude', 'latitude', 'event_id_cnty']].sample(1000).\
reset_index()
df.rename(columns = {'event_id_cnty':'token_id'},inplace = True)
df.event_date = pd.to_datetime(df.event_date, errors='coerce')
df_copy = df.copy()
print(df.shape)
(1000, 17)
df_copy = df_copy[['token_id', 'country', 'year', 'event_date', 'admin1', 'admin2','notes']]
df_copy.head()
| token_id | country | year | event_date | admin1 | admin2 | notes | |
|---|---|---|---|---|---|---|---|
| 0 | SYR24162 | Syria | 2017 | 2017-02-27 | Ar-Raqqa | Ar-Raqqa | US Global Coalition warplanes bombarded areas ... |
| 1 | SYR60990 | Syria | 2019 | 2019-05-21 | Hama | As-Suqaylabiyah | On 21 May, 2019, regime warplanes carried out ... |
| 2 | SYR81969 | Syria | 2020 | 2020-09-09 | Ar-Raqqa | Ar-Raqqa | On 9 September 2020, an unknown armed group sh... |
| 3 | SYR63026 | Syria | 2019 | 2019-06-26 | Hama | Muhradah | On 26 June, 2019, Russian warplanes carried ou... |
| 4 | SYR51193 | Syria | 2018 | 2018-12-29 | Ar-Raqqa | Tell Abiad | Movement of forces: A Global Coalition convoy ... |
# set configs for OHCO
OHCO = ['country', 'year', 'event_date', 'admin1', 'admin2','token_id', 'token_num']
DATE = OHCO[:3]
ADMIN1 = OHCO[:4]
SENTS = OHCO[:6]
df = df.groupby(OHCO[:6]).notes.apply(lambda x: '\n'.join(x)).to_frame()
df.index.names = OHCO[:6]
df.head()
| notes | ||||||
|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | admin2 | token_id | |
| Syria | 2017 | 2017-01-01 | Dar'a | Dar'a | SYR43 | Syrian regime forces shelled areas in Yadodeh ... |
| Homs | Homs | SYR9 | Syrian regime forces shelled places in the vil... | |||
| 2017-01-02 | Aleppo | Jebel Saman | SYR23081 | The Syrian army shelled the rebel factions pos... | ||
| Homs | Homs | SYR34496 | Syrian regime forces artillery shelled Al Maha... | |||
| Idleb | Harim | SYR60 | Unidentified warplane targeted a car in Sarmad... |
With unprocessed data
token_freq = pd.Series(' '.join(df['notes'])\
.split())\
.value_counts()[:20]\
.to_frame().reset_index()\
.rename(columns= {'index': 'token', 0: 'count'})
# Top 20 tokens by frequency
# https://pythonbasics.org/seaborn-barplot/
plt.figure(figsize=(15,8))
sns.barplot(x='token', y='count', data = token_freq,\
palette = 'viridis')\
.set_title('Top 20 Tokens by Frequency',
fontsize = 18)
plt.xticks(rotation=45);
# Word Cloud using tokens from df.notes
# https://re-thought.com/creating-wordclouds-in-python/
text = " ".join(word for word in df.notes)
wordcloud = WordCloud(background_color="green",
width=800,
height=400)\
.generate(text)
plt.figure(figsize=(15,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show();
from nltk import word_tokenize, pos_tag, pos_tag_sents
import pandas as pd
OHCO = ['country', 'year', 'event_date', 'admin1', 'admin2','token_id']
df_copy.head()
| token_id | country | year | event_date | admin1 | admin2 | notes | |
|---|---|---|---|---|---|---|---|
| 0 | SYR24162 | Syria | 2017 | 2017-02-27 | Ar-Raqqa | Ar-Raqqa | US Global Coalition warplanes bombarded areas ... |
| 1 | SYR60990 | Syria | 2019 | 2019-05-21 | Hama | As-Suqaylabiyah | On 21 May, 2019, regime warplanes carried out ... |
| 2 | SYR81969 | Syria | 2020 | 2020-09-09 | Ar-Raqqa | Ar-Raqqa | On 9 September 2020, an unknown armed group sh... |
| 3 | SYR63026 | Syria | 2019 | 2019-06-26 | Hama | Muhradah | On 26 June, 2019, Russian warplanes carried ou... |
| 4 | SYR51193 | Syria | 2018 | 2018-12-29 | Ar-Raqqa | Tell Abiad | Movement of forces: A Global Coalition convoy ... |
df_copy = df_copy.set_index(OHCO)
df_copy.head()
| notes | ||||||
|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | admin2 | token_id | |
| Syria | 2017 | 2017-02-27 | Ar-Raqqa | Ar-Raqqa | SYR24162 | US Global Coalition warplanes bombarded areas ... |
| 2019 | 2019-05-21 | Hama | As-Suqaylabiyah | SYR60990 | On 21 May, 2019, regime warplanes carried out ... | |
| 2020 | 2020-09-09 | Ar-Raqqa | Ar-Raqqa | SYR81969 | On 9 September 2020, an unknown armed group sh... | |
| 2019 | 2019-06-26 | Hama | Muhradah | SYR63026 | On 26 June, 2019, Russian warplanes carried ou... | |
| 2018 | 2018-12-29 | Ar-Raqqa | Tell Abiad | SYR51193 | Movement of forces: A Global Coalition convoy ... |
def tokenize(doc_df, OHCO=OHCO, remove_pos_tuple=False, ws=False):
# Sentences to Tokens
# Local function to pick tokenizer
def word_tokenize(x):
if ws:
s = pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))
else:
s = pd.Series(nltk.pos_tag(nltk.word_tokenize(x)))
return s
df = doc_df.notes\
.apply(word_tokenize)\
.stack()\
.to_frame()\
.rename(columns={0:'pos_tuple'})
# Grab info from tuple
df['pos'] = df.pos_tuple.apply(lambda x: x[1])
df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
if remove_pos_tuple:
df = df.drop('pos_tuple', 1)
# Add index
# df.index.names = OHCO
return df
TOKEN = tokenize(df_copy)
TOKEN = TOKEN[~TOKEN.pos.str.match('^NNP')]
TOKEN['term_str'] = TOKEN['token_str'].str.lower().str.replace('[\W_]', '')
TOKEN.head()
| pos_tuple | pos | token_str | term_str | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | admin2 | token_id | |||||
| Syria | 2017 | 2017-02-27 | Ar-Raqqa | Ar-Raqqa | SYR24162 | 0 | (US, PRP) | PRP | US | us |
| 1 | (Global, JJ) | JJ | Global | global | ||||||
| 3 | (warplanes, NNS) | NNS | warplanes | warplanes | ||||||
| 4 | (bombarded, VBD) | VBD | bombarded | bombarded | ||||||
| 5 | (areas, NNS) | NNS | areas | areas |
# counts get unique values with frequency, clean and create df
VOCAB = TOKEN.term_str.value_counts().to_frame().rename(columns={'index':'term_str', 'term_str':'n'})\
.sort_index().reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'
VOCAB['num'] = VOCAB.term_str.str.match("\d+").astype('int')
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1
VOCAB['stop'] = VOCAB.term_str.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')
VOCAB[VOCAB.stop == 1].sample(10)
| term_str | n | num | stop | |
|---|---|---|---|---|
| term_id | ||||
| 1300 | to | 239 | 0 | 1 |
| 1294 | through | 1 | 0 | 1 |
| 1280 | that | 26 | 0 | 1 |
| 977 | over | 33 | 0 | 1 |
| 244 | an | 160 | 0 | 1 |
| 1289 | this | 2 | 0 | 1 |
| 550 | during | 25 | 0 | 1 |
| 900 | most | 1 | 0 | 1 |
| 643 | for | 33 | 0 | 1 |
| 1283 | them | 13 | 0 | 1 |
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
VOCAB['p_stem'] = VOCAB.term_str.apply(stemmer.stem)
VOCAB = VOCAB.dropna()
TOKEN = TOKEN.dropna()
TOKEN['term_id'] = TOKEN.term_str.map(VOCAB.reset_index().set_index('term_str').term_id)
TOKEN.head()
| pos_tuple | pos | token_str | term_str | term_id | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | admin2 | token_id | ||||||
| Syria | 2017 | 2017-02-27 | Ar-Raqqa | Ar-Raqqa | SYR24162 | 0 | (US, PRP) | PRP | US | us | 1343 |
| 1 | (Global, JJ) | JJ | Global | global | 674 | ||||||
| 3 | (warplanes, NNS) | NNS | warplanes | warplanes | 1370 | ||||||
| 4 | (bombarded, VBD) | VBD | bombarded | bombarded | 337 | ||||||
| 5 | (areas, NNS) | NNS | areas | areas | 256 |
VOCAB['pos_max'] = TOKEN.groupby(['term_id', 'pos'])\
.count()\
.iloc[:,0]\
.unstack()\
.idxmax(1)
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | |
|---|---|---|---|---|---|---|
| term_id | ||||||
| 0 | 3331 | 0 | 0 | . | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD |
| 2 | 02 | 7 | 1 | 0 | 02 | CD |
| 3 | 03 | 4 | 1 | 0 | 03 | CD |
| 4 | 04 | 4 | 1 | 0 | 04 | CD |
POS = TOKEN.pos.value_counts()\
.to_frame()\
.rename(columns={'pos':'n'})
POS.index.name = 'pos_id'
POS.sort_values('n').plot.bar(y='n', figsize=(15,5), rot=45);
POS_MAX = VOCAB.pos_max.value_counts().to_frame().rename(columns={'pos_max':'n'})
POS_MAX.index.name = 'pos_id'
VOCAB.groupby('pos_max').n.sum().sort_values().plot.bar(figsize=(15,5), rot=45);
POS['max_n'] = POS_MAX['n']
POS.head()
| n | max_n | |
|---|---|---|
| pos_id | ||
| IN | 4402 | 36.0 |
| NN | 3933 | 417.0 |
| NNS | 3662 | 205.0 |
| DT | 2126 | 14.0 |
| JJ | 1862 | 246.0 |
# import plotly_express as px
fig = px.scatter(POS.reset_index(), x='n', y='max_n', text='pos_id')
fig.data[0].update(mode='text')
fig.show()
new_rank = VOCAB.n.value_counts()\
.sort_index(ascending=False).reset_index().reset_index()\
.rename(columns={'level_0':'term_rank2', 'index':'n', 'n':'nn'})\
.set_index('n')
VOCAB['term_rank'] = VOCAB.n.map(new_rank.term_rank2) + 1
VOCAB['p'] = VOCAB.n / TOKEN.shape[0]
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | |
|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 |
px.scatter(VOCAB[VOCAB.term_rank <= 10000],
x='term_rank', y='n',
log_y=False,
log_x=False,
hover_name='term_str',
color='pos_max')
VOCAB.plot.scatter('term_rank', 'n', figsize=(10,8));
VOCAB.plot.scatter('term_rank', 'n',
figsize=(10,8),
logx=True,
logy=True);
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | |
|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 |
count_method = 'n' # 'c' or 'n' # n = n tokens, c = distinct token (term) count
tf_method = 'raw' # sum, max, log, double_norm, raw, binary
tf_norm_k = .5 # only used for double_norm
idf_method = 'standard' # standard, max, smooth
gradient_cmap = 'GnBu' # YlGn, GnBu, YlGnBu; For tables; see https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
bag = ADMIN1
# recall, we chose "chapters" for the bag
# and add term ids, get a count, which turns it into a series, so turn it to a frame, rename the columns
BOW = TOKEN.groupby(bag+['term_id'])\
.term_id.count()\
.to_frame()\
.rename(columns={'term_id':'n'})
# create a boolean that shows a 1 for each term, need for a matrix later
BOW['c'] = BOW.n.astype('bool').astype('int')
BOW.head()
| n | c | |||||
|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | term_id | ||
| Syria | 2017 | 2017-01-01 | Dar'a | 0 | 2 | 1 |
| 99 | 1 | 1 | ||||
| 256 | 1 | 1 | ||||
| 645 | 1 | 1 | ||||
| 751 | 2 | 1 |
# convert into a count matrix
# use "n" or "c" - (count_method) established at the top
# unstack takes the last column of the index level and project it across the columns
DTCM = BOW[count_method].unstack().fillna(0).astype('int')
DTCM.head(10)
| term_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 1406 | 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | 1415 | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | |||||||||||||||||||||
| Syria | 2017 | 2017-01-01 | Dar'a | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Homs | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
| 2017-01-02 | Aleppo | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| Homs | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
| Idleb | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
| Rural Damascus | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
| 2017-01-04 | Rural Damascus | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 2017-01-05 | Rural Damascus | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 2017-01-06 | Hama | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 2017-01-07 | Hama | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 1416 columns
print('TF method:', tf_method)
if tf_method == 'sum':
TF = DTCM.T / DTCM.T.sum()
elif tf_method == 'max':
TF = DTCM.T / DTCM.T.max()
elif tf_method == 'log':
TF = np.log10(1 + DTCM.T)
elif tf_method == 'raw':
TF = DTCM.T
elif tf_method == 'double_norm':
TF = DTCM.T / DTCM.T.max()
TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0]
elif tf_method == 'binary':
TF = DTCM.T.astype('bool').astype('int')
TF = TF.T
TF method: raw
# Compute DF
DF = DTCM[DTCM > 0].count()
# Compute IDF
N = DTCM.shape[0]
print('IDF method:', idf_method)
if idf_method == 'standard':
IDF = np.log10(N / DF)
elif idf_method == 'max':
IDF = np.log10(DF.max() / DF)
elif idf_method == 'smooth':
IDF = np.log10((1 + N) / (1 + DF)) + 1
### Compute IDF
TFIDF = TF * IDF
IDF method: standard
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | |
|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 |
VOCAB['df'] = DF
VOCAB['idf'] = IDF
VOCAB['tfidf_mean'] = TFIDF[TFIDF > 0].mean().fillna(0) # EXPLAIN
VOCAB['tfidf_sum'] = TFIDF.sum()
VOCAB['tfidf_median'] = TFIDF[TFIDF > 0].median().fillna(0) # EXPLAIN
VOCAB['tfidf_max'] = TFIDF.max()
BOW['tf'] = TF.stack()
BOW['tfidf'] = TFIDF.stack()
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | df | idf | tfidf_mean | tfidf_sum | tfidf_median | tfidf_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | 944 | 0.000460 | 0.001623 | 1.531641 | 0.001379 | 0.039084 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 | 2 | 2.674402 | 2.674402 | 5.348804 | 2.674402 | 2.674402 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 | 7 | 2.130334 | 2.130334 | 14.912336 | 2.130334 | 2.130334 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
px.scatter(VOCAB, x='term_rank',
y='tfidf_mean',
hover_name='term_str',
hover_data=['n'], color='pos_max',
log_x=False, log_y=False)
px.scatter(VOCAB, x='term_rank',
y='tfidf_sum',
hover_name='term_str',
hover_data=['n'],
color='pos_max')
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.core.display import display, HTML
sns.set()
%matplotlib inline
salex = pd.read_csv('salex_nrc.csv').set_index('term_str')
salex.columns = [col.replace('nrc_','') for col in salex.columns]
salex['polarity'] = salex.positive - salex.negative
#salex.head()
# a subset of emotions to select just these columns
emo_cols = "anger anticipation disgust fear joy sadness surprise trust polarity".split()
TOKEN = TOKEN.join(salex, on = 'term_str', how='left')
TOKEN[emo_cols] = TOKEN[emo_cols].fillna(0)
TOKEN.groupby('token_id')\
.polarity.value_counts()\
.unstack(0)\
.plot.barh()\
.get_legend().remove()
#plot polarity for all provinces
TOKEN.groupby('admin1')\
.polarity.value_counts()\
.unstack(0).plot.barh()
plt.title('Syria: Polarity counts by province');
plt.xlabel('');
plt.ylabel('Polarity counts');
plt.xticks(rotation=0, horizontalalignment='right');
plt.rcParams['figure.figsize'] = [14,12] #[width, height]
#plt.show()
#plt.savefig("FatalitiesByProvince.jpg")
# Create a bar plot to see the polarity values by Province.
plt.figure(figsize=(10,5))
sns.barplot(x='polarity',
y='admin1',
data = TOKEN.reset_index(),\
palette = 'rocket')\
.set_title('Polarity by Province',fontsize = 18)
plt.xticks(rotation=45);
# get the 1st and 2nd derivative to smooth the lines
TOKEN['new_polarity'] = TOKEN.polarity.diff()
TOKEN['growth_new_polarity'] = TOKEN.new_polarity.diff()
TOKEN['mov_avg'] = TOKEN['new_polarity'].rolling(3).sum()
# w = int(TOKEN.polarity[0] / 10)
# TOKEN[['polarity']].rolling(w).mean().plot(figsize=(25,5));
fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.lineplot(
x="event_date",
y="mov_avg",
hue="admin1",
data=TOKEN,
ci= None)
ax.set(xlabel="Date",
ylabel="Polarity",
title="Syria: Polarity counts over time by Province")
# Define the date format
date_form = DateFormatter("%m-%y")
ax.xaxis.set_major_formatter(date_form)
# Ensure a major tick for each week using (interval=1)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=30))
TOKEN[emo_cols].mean().sort_values().plot.barh(cmap='viridis')\
.set_title('Distribution of Emotions - OHCO Method',fontsize = 18);
admin_emo_cols = TOKEN.groupby('admin1')[emo_cols].mean()
def plot_sentiments(df, emo='polarity'):
FIG = dict(figsize=(25, 10), legend=True,
fontsize=14, rot=45, xticks=())
df[emo].plot(**FIG).axes.xaxis.set_visible(False)
plot_sentiments(admin_emo_cols, ['fear','anger','sadness','trust', 'joy'])
admin_emo_cols = TOKEN.groupby('admin1')[emo_cols].mean()
def plot_sentiments(df, emo='polarity'):
FIG = dict(figsize=(25, 10), legend=True,
fontsize=14, rot=45, xticks=())
df[emo].plot(**FIG).axes.xaxis.set_visible(False)
plot_sentiments(admin_emo_cols, ['fear','anger','sadness','trust'])
plot_sentiments(admin_emo_cols, ['polarity'])
emo = 'polarity'
TOKEN['html'] = TOKEN\
.apply(lambda x: "<span class='sent{}'>{}</span>"\
.format(int(np.sign(x[emo])), x.token_str), 1)
#TOKEN['html'].sample(10)
# Now at sentence level
sents_polarity = TOKEN.groupby(SENTS)[emo_cols].mean()
sents_polarity['sent_str'] = TOKEN\
.groupby(SENTS).term_str.apply(lambda x: x.str.cat(sep=' '))
sents_polarity['html_str'] = TOKEN\
.groupby(SENTS).html.apply(lambda x: x.str.cat(sep=' '))
# a function to create some html to visualize the results
# NOTE - this just pulls a sample(10)
def sample_sentences(df):
rows = []
for idx in df.sample(10).index:
valence = round(df.loc[idx, emo], 4)
t = 0
if valence > t: color = '#ccffcc'
elif valence < t: color = '#ffcccc'
else: color = '#f2f2f2'
z=0
rows.append("""<tr style="background-color:{0};padding:.5rem 1rem;font-size:110%;">
<td>{1}</td><td>{3}</td><td width="400" style="text-align:left;">{2}</td>
</tr>""".format(color, valence, df.loc[idx, 'html_str'], idx))
display(HTML('<style>#sample1 td{font-size:120%;vertical-align:top;} .sent-1{color:red;font-weight:bold;} .sent1{color:green;font-weight:bold;}</style>'))
display(HTML('<table id="sample1"><tr><th>Sentiment</th><th>ID</th><th width="600">Sentence</th></tr>'+''.join(rows)+'</table>'))
sample_sentences(sents_polarity)
| Sentiment | ID | Sentence |
|---|---|---|
| 0.0 | ('Syria', 2019, Timestamp('2019-11-30 00:00:00'), 'Aleppo', "A'zaz", 'SYR71506') | On 30 2019 , Turkish forces shelled in . Neither injuries nor fatalities were reported . |
| -0.0435 | ('Syria', 2019, Timestamp('2019-03-05 00:00:00'), 'Hama', 'Muhradah', 'SYR53567') | On 5 of 2019 , regime forces shelled and its outskirts in with artillery shells . Neither injuries nor fatalities were reported . |
| -0.0256 | ('Syria', 2019, Timestamp('2019-12-12 00:00:00'), 'Damascus', 'Damascus', 'SYR72687') | Arrests : On 12 2019 , members of the regime 's arrested four women from the villages of in for unknown reasons . The women were arrested in the vicinity of neighborhood checkpoint and at the checkpoint in . |
| 0.0 | ('Syria', 2018, Timestamp('2018-05-27 00:00:00'), 'Hama', 'Muhradah', 'SYR33378') | Regime forces fired rocket shells at the town of al-Latamna with no injuries or fatalities reported . |
| -0.0204 | ('Syria', 2017, Timestamp('2017-12-18 00:00:00'), 'Idleb', "Al Ma'ra", 'SYR17528') | The Syrian army , backed by Russian airstrikes , managed to control village in the southern countryside of after clashes with al-Sham and rebel fighters . However , the clashes are still ongoing in an attempt by to regain control over the village . No fatalities were reported . |
| -0.0526 | ('Syria', 2017, Timestamp('2017-08-29 00:00:00'), 'Deir-ez-Zor', 'Deir-ez-Zor', 'SYR44140') | Syrian regime forces carried out airstrikes on the village of in the governorate of , killing one civilian . |
| 0.0 | ('Syria', 2020, Timestamp('2020-02-20 00:00:00'), 'Hama', 'As-Suqaylabiyah', 'SYR75790') | On 20 , 2020 , Russian warplanes carried out airstrikes on in in countryside . Neither injuries nor fatalities were reported . |
| 0.0 | ('Syria', 2017, Timestamp('2017-11-28 00:00:00'), 'Aleppo', 'Jebel Saman', 'SYR16811') | The forces targeted the positions of the Turkish forces in The western countryside of , no casualties were reported . |
| -0.0435 | ('Syria', 2020, Timestamp('2020-06-17 00:00:00'), 'Aleppo', 'Jebel Saman', 'SYR79398') | On 17 2020 , Turkish forces shelled regime forces positions west of city with artillery . Neither injuries nor fatalities were reported . |
| 0.0 | ('Syria', 2018, Timestamp('2018-01-18 00:00:00'), 'Idleb', "Al Ma'ra", 'SYR42467') | Russian warplanes conducted an airstrike on the town of in . No casualties reported . |
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
#!pip install vaderSentiment
# apply the analyzer to each sentence
# attach it to the DRACULA_sents table
vader_cols = sents_polarity.sent_str.apply(analyser.polarity_scores)\
.apply(lambda x: pd.Series(x))
vader = pd.concat([sents_polarity, vader_cols], axis=1)
Positive and Negative
# using a rolling method to smooth the lines
# using 1/5 of the values
w = int(vader.shape[0] / 5)
vader[['pos','neg']].rolling(w).mean().plot(figsize=(25,5));
Neutral
vader[['neu']].rolling(w).mean().plot(figsize=(25,5));
Compound - combination of pos and neg
vader[['compound']].rolling(w).mean().plot(figsize=(25,5));
# https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972
# https://madhuramiah.medium.com/geographic-plotting-with-python-folium-2f235cc167b7
n_terms = 4000
n_topics = 10
max_iter = 5
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
%matplotlib inline
We use Scikit Learn's CountVectorizer to convert our F1 corpus of paragraphs into a document-term vector space of word counts.
# Create a vector space
tfv = CountVectorizer(max_features=n_terms, stop_words='english')
tf = tfv.fit_transform(TOKEN.token_str)
TERMS = tfv.get_feature_names()
# Generate Model
# Scikit Learn's LatentDirichletAllocation algorithm and extract the THETA and PHI tables.
lda = LDA(n_components=n_topics, max_iter=max_iter, learning_offset=50., random_state=0)
THETA = pd.DataFrame(lda.fit_transform(tf), index=TOKEN.index)
THETA.columns.name = 'topic_id'
THETA.sample(10).style.background_gradient()
| topic_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country | year | event_date | admin1 | admin2 | token_id | |||||||||||
| Syria | 2017 | 2017-11-18 00:00:00 | Aleppo | Afrin | SYR38767 | 20 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.550000 | 0.050000 |
| 2020 | 2020-02-14 00:00:00 | Aleppo | Jebel Saman | SYR75478 | 155 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | |
| 2017 | 2017-11-27 00:00:00 | Rural Damascus | Duma | SYR16752 | 43 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | |
| 2019 | 2019-05-21 00:00:00 | Hama | As-Suqaylabiyah | SYR60990 | 15 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.550000 | 0.050000 | |
| 2017 | 2017-03-21 00:00:00 | Damascus | Damascus | SYR4461 | 22 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | |
| 2019 | 2019-07-21 00:00:00 | Hama | Muhradah | SYR64647 | 9 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.550000 | 0.050000 | |
| 2019-04-26 00:00:00 | Hama | As-Suqaylabiyah | SYR59481 | 16 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.550000 | 0.050000 | 0.050000 | 0.050000 | ||
| 2017 | 2017-12-27 00:00:00 | Idleb | Idleb | SYR17903 | 11 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | |
| 2017-09-18 00:00:00 | Rural Damascus | Rural Damascus | SYR12827 | 26 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | ||
| 2017-09-29 00:00:00 | Rural Damascus | Rural Damascus | SYR40387 | 6 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.550000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 | 0.050000 |
PHI = pd.DataFrame(lda.components_, columns=TERMS)
PHI.index.name = 'topic_id'
PHI.columns.name = 'term_str'
PHI.T.head().style.background_gradient()
| topic_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| term_str | ||||||||||
| 000 | 2.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 |
| 01 | 0.100000 | 0.100000 | 4.099995 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100005 |
| 02 | 0.100000 | 0.100000 | 7.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 |
| 03 | 0.100000 | 0.100000 | 0.100000 | 4.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 |
| 04 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 0.100000 | 4.100000 | 0.100000 |
Get Top Terms per Topic
TOPICS = PHI.stack().to_frame().rename(columns={0:'weight'})\
.groupby('topic_id')\
.apply(lambda x:
x.weight.sort_values(ascending=False)\
.head(10)\
.reset_index()\
.drop('topic_id',1)\
.term_str)
TOPICS.head()
| term_str | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| topic_id | ||||||||||
| 0 | syrian | clashes | village | unidentified | civilian | 10 | ground | fighter | al | total |
| 1 | city | targeted | vicinity | bombs | resulted | accompanied | shot | neighborhood | information | border |
| 2 | reported | 2020 | artillery | shelling | civilians | killing | military | army | raids | including |
| 3 | countryside | warplanes | 2019 | casualties | rebel | militias | helicopters | armed | outskirts | fired |
| 4 | airstrikes | killed | areas | province | shells | western | rocket | rebels | rural | controlled |
TOPICS['label'] = TOPICS.apply(lambda x: str(x.name) + ' ' + ' '.join(x), 1)
TOPICS['doc_weight_sum'] = THETA.sum()
TOPICS.sort_values('doc_weight_sum', ascending=True)\
.plot.barh(y='doc_weight_sum', x='label', figsize=(5,10));
Sorted descending in terms of Idleb province, which seem to be the most volitile province.
#THETA.head()
topic_cols = [t for t in range(n_topics)]
ADMIN1 = THETA.groupby('admin1')[topic_cols].mean().T
ADMIN1.index.name = 'topic_id'
#ADMIN1.T
ADMIN1['topterms'] = TOPICS[[i for i in range(10)]].apply(lambda x: ' '.join(x), 1)
ADMIN1.sort_values('Idleb', ascending=False).style.background_gradient()
| admin1 | Al-Hasakeh | Aleppo | Ar-Raqqa | As-Sweida | Damascus | Dar'a | Deir-ez-Zor | Hama | Homs | Idleb | Lattakia | Quneitra | Rural Damascus | topterms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| topic_id | ||||||||||||||
| 6 | 0.099603 | 0.103162 | 0.096709 | 0.100650 | 0.095354 | 0.103529 | 0.097271 | 0.107345 | 0.104562 | 0.109705 | 0.111218 | 0.097742 | 0.097916 | fatalities injuries area barrel conducted movement man district 18 mortar |
| 2 | 0.110244 | 0.107719 | 0.104147 | 0.093333 | 0.108565 | 0.106748 | 0.105421 | 0.107420 | 0.105067 | 0.109659 | 0.109348 | 0.117663 | 0.109354 | reported 2020 artillery shelling civilians killing military army raids including |
| 3 | 0.100628 | 0.106491 | 0.100090 | 0.107967 | 0.097336 | 0.104602 | 0.097653 | 0.109108 | 0.096481 | 0.106023 | 0.111752 | 0.099734 | 0.097334 | countryside warplanes 2019 casualties rebel militias helicopters armed outskirts fired |
| 8 | 0.098449 | 0.100705 | 0.096372 | 0.095772 | 0.098107 | 0.103410 | 0.108921 | 0.107908 | 0.091936 | 0.103631 | 0.103739 | 0.114343 | 0.106899 | town shelled fighters al russian coded southern using militia clashed |
| 7 | 0.103833 | 0.104232 | 0.104485 | 0.125854 | 0.107354 | 0.104840 | 0.099458 | 0.107495 | 0.104731 | 0.099903 | 0.109081 | 0.109695 | 0.103474 | regime northern turkish place eastern factions positions opposition suspected pro |
| 9 | 0.103577 | 0.099437 | 0.099977 | 0.107968 | 0.099978 | 0.095899 | 0.106077 | 0.091815 | 0.100522 | 0.096836 | 0.093323 | 0.089774 | 0.100307 | carried took near unknown number gunmen control injuring following exchange |
| 5 | 0.096013 | 0.094999 | 0.097048 | 0.088455 | 0.097996 | 0.094111 | 0.090433 | 0.094178 | 0.098502 | 0.095194 | 0.092521 | 0.095086 | 0.096494 | forces group injured air dropped heavy led north missiles 17 |
| 0 | 0.100244 | 0.094801 | 0.098963 | 0.105528 | 0.097116 | 0.097687 | 0.102412 | 0.094253 | 0.107256 | 0.094213 | 0.093590 | 0.100398 | 0.097011 | syrian clashes village unidentified civilian 10 ground fighter al total |
| 4 | 0.091397 | 0.095554 | 0.099752 | 0.086016 | 0.094033 | 0.095660 | 0.095684 | 0.092528 | 0.093451 | 0.093814 | 0.086912 | 0.087782 | 0.097173 | airstrikes killed areas province shells western rocket rebels rural controlled |
| 1 | 0.096013 | 0.092899 | 0.102457 | 0.088455 | 0.104161 | 0.093515 | 0.096669 | 0.087951 | 0.097492 | 0.091022 | 0.088515 | 0.087782 | 0.094038 | city targeted vicinity bombs resulted accompanied shot neighborhood information border |
Comparing Idleb, which is dominated by hardline Arab opposition factions (largely backed by Turkey) with Deir ez Zor province, which is Kurd dominated opposition (largely backed by the US).
import plotly_express as px
px.scatter(ADMIN1.reset_index(), 'Idleb', 'Deir-ez-Zor', hover_name='topterms', text='topic_id')\
.update_traces(mode='text')
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import pdist
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize
import matplotlib.pyplot as plt
def plot_tree(tree, labels):
plt.figure()
fig, axes = plt.subplots(figsize=(14, 15))
dendrogram = sch.dendrogram(tree, labels=labels, orientation="left", truncate_mode='lastp')
plt.tick_params(axis='both', which='major', labelsize=14)
SIMS = pdist(normalize(PHI), metric='euclidean')
TREE = sch.linkage(SIMS, method='ward')
labels = ["{}: {}".format(a,b) for a, b in zip(ADMIN1.index, ADMIN1.topterms.tolist())]
plot_tree(TREE, labels)
<Figure size 432x288 with 0 Axes>
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | df | idf | tfidf_mean | tfidf_sum | tfidf_median | tfidf_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | 944 | 0.000460 | 0.001623 | 1.531641 | 0.001379 | 0.039084 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 | 2 | 2.674402 | 2.674402 | 5.348804 | 2.674402 | 2.674402 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 | 7 | 2.130334 | 2.130334 | 14.912336 | 2.130334 | 2.130334 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
# Create a LIB dataframe with product metadata
library = df_raw[['country','admin1','event_date']]
LIB = pd.DataFrame(library.groupby('admin1')['event_date'].unique())
# Export files to csv
LIB.to_csv('LIB.csv')
VOCAB.to_csv('VOCAB.csv')
TOKEN.to_csv('TOKEN.csv')
VOCAB.head()
| term_str | n | num | stop | p_stem | pos_max | term_rank | p | df | idf | tfidf_mean | tfidf_sum | tfidf_median | tfidf_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| term_id | ||||||||||||||
| 0 | 3331 | 0 | 0 | . | 1 | 0.124701 | 944 | 0.000460 | 0.001623 | 1.531641 | 0.001379 | 0.039084 | ||
| 1 | 01 | 2 | 1 | 0 | 01 | CD | 110 | 0.000075 | 2 | 2.674402 | 2.674402 | 5.348804 | 2.674402 | 2.674402 |
| 2 | 02 | 7 | 1 | 0 | 02 | CD | 105 | 0.000262 | 7 | 2.130334 | 2.130334 | 14.912336 | 2.130334 | 2.130334 |
| 3 | 03 | 4 | 1 | 0 | 03 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
| 4 | 04 | 4 | 1 | 0 | 04 | CD | 108 | 0.000150 | 4 | 2.373372 | 2.373372 | 9.493487 | 2.373372 | 2.373372 |
Logistic Regression with train/test split
In Python, I compete three classification approaches. First, I use a base line model, which is the simplest, form sklearn. Second, I leverage spaCy for preprocessing and added the TF-IDF module with the sklearn for classification. Last, I implemented a Keras neural network, which is a high-level language that sits on top of TensorFlow.
df_raw.columns
Index(['region', 'country', 'year', 'event_date', 'source', 'admin1', 'admin2',
'admin3', 'location', 'event_type', 'sub_event_type', 'interaction',
'fatalities', 'timestamp', 'actor1', 'assoc_actor_1', 'actor2',
'assoc_actor_2', 'notes', 'longitude', 'latitude', 'geo_precision',
'inter1', 'inter2', 'event_id_cnty'],
dtype='object')
df_raw['sub_event_type'].unique()
array(['Abduction/forced disappearance', 'Armed clash',
'Shelling/artillery/missile attack', 'Attack',
'Remote explosive/landmine/IED', 'Air/drone strike',
'Change to group/activity', 'Looting/property destruction',
'Arrests', 'Violent demonstration', 'Peaceful protest', 'Grenade',
'Other', 'Headquarters or base established',
'Disrupted weapons use', 'Suicide bomb',
'Excessive force against protesters', 'Protest with intervention',
'Non-state actor overtakes territory', 'Mob violence', 'Agreement',
'Sexual violence', 'Non-violent transfer of territory',
'Government regains territory', 'Chemical weapon'], dtype=object)
event_list = ['Non-state actor overtakes territory','Government regains territory']
filtered_df = df_raw[df_raw['sub_event_type'].isin(event_list)]
# filtered_df['sub_event_type'] = filtered_df['sub_event_type']\
# .map({'Non-state actor overtakes territory': 1, 'Government regains territory': 0})
filtered_df['sub_event_type'] = filtered_df['sub_event_type']\
.apply(lambda x: 0 if x=='Non-state actor overtakes territory' else 1)
<ipython-input-79-cef13c77fde9>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
filtered_df['sub_event_type'].head()
filtered_df['sub_event_type'].value_counts()
1 1786 0 1769 Name: sub_event_type, dtype: int64
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
# Create our list of punctuation marks
punctuations = string.punctuation
# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
#stop_words = spacy.lang.en.stop_words.STOP_WORDS
# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()
# Creating our tokenizer function
def spacy_tokenizer(sentence):
# Creating our token object, which is used to create documents with linguistic annotations.
mytokens = parser(sentence)
# Lemmatizing each token and converting each token into lowercase
mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
# Removing stop words
mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
# return preprocessed list of tokens
return mytokens
# Custom transformer using spaCy
class predictors(TransformerMixin):
def transform(self, X, **transform_params):
# Cleaning Text
return [clean_text(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
# Basic function to clean the text
def clean_text(text):
# Removing spaces and converting text into lowercase
return text.strip().lower()
#bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
bow_vector = CountVectorizer(max_features=n_terms, stop_words='english')
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
from sklearn.model_selection import train_test_split
X = filtered_df['notes'] # the features to analyze
ylabels = filtered_df['sub_event_type'] # the labels
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)
X_train.head()
11706 On 04 February, 2020, regime forces captured U... 75869 Syrian democratic forces captured Safsaf villa... 64388 Clashes took place between Islamic State again... 52754 Regime forces reportedly established control o... 74450 Violent clashes between Lions of the East Army... Name: notes, dtype: object
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
('vectorizer', bow_vector),
('classifier', classifier)])
# model generation
model = pipe.fit(X_train,y_train)
pipe.fit(X_train,y_train)
Pipeline(steps=[('cleaner', <__main__.predictors object at 0x0000022931950820>),
('vectorizer',
CountVectorizer(max_features=4000, stop_words='english')),
('classifier', LogisticRegression())])
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
confusion_matrix(y_test, predicted)
array([[468, 62],
[ 17, 520]], dtype=int64)
# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
Logistic Regression Accuracy: 0.9259606373008434 Logistic Regression Precision: 0.8934707903780069 Logistic Regression Recall: 0.9683426443202979
from sklearn.model_selection import train_test_split
sentences = filtered_df['notes'].values
y = filtered_df['sub_event_type'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
X_train
<2666x4527 sparse matrix of type '<class 'numpy.int64'>' with 85750 stored elements in Compressed Sparse Row format>
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9426321709786277
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
from keras.models import Sequential
from keras import layers
input_dim = X_train.shape[1] # Number of features
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
import tensorflow as tf; print(tf.__version__)
2.4.1
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 10) 45280 _________________________________________________________________ dense_1 (Dense) (None, 1) 11 ================================================================= Total params: 45,291 Trainable params: 45,291 Non-trainable params: 0 _________________________________________________________________
history = model.fit(X_train, y_train,
epochs=100,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
from keras.backend import clear_session
clear_session()
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 1.0000 Testing Accuracy: 0.9303
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def plot_history(history):
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
x = range(1, len(acc) + 1)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, acc, 'b', label='Training acc')
plt.plot(x, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, loss, 'b', label='Training loss')
plt.plot(x, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plot_history(history)